<u>UNIT-IV</u> <u>PROCESSOR</u> (12<sup>93</sup> Instruction Execution - Building a Data Path - 425 <sup>15</sup>Designing a Control unit - Hardwired controls <sup>16</sup>Microprogrammed Control - Pipelining - Data Hazard-Microprogrammed Control - Pipelining - Data Hazard-Control Hazarde (1<sup>3b</sup>)

INSTRUCTION EXECUTION \* sequence of operations required to execute one instruction. R1-30 R3 - 1000 Add (R3), RI -> Adds the contents of a memory location pointed to by RB to Register RI. Actions! 1. Fetch the instruction 2. Fetch the first operand hours 3. Perform the addition 4. Load the result into RI mart <u>Control sequence</u> for execution of the instruction Action Step PCout, MARin, Read, Select &, Add, Zin Ì Zout, PCin, Yin, WMFC 2 MDRout, IRis 3 RBout, IRin 4 Rlout, Yis, WMFC 5 MDRout, Select Y, Add, Zis Zout, Rlin, End Instruction execution proceeds as follows: Intruction Fetch Phase; Stept In × - the instruction fetch operation is initiated by loading the contents of the PC is to the MAR. and sending a Read request to the memory - The Select signal is set to select 4, which

Comple Te Function Com - Waiting for Memory

NMR



Figure 7.1 Single-bus organization of the datapath inside a processor.

causes the multiplexer Mux to select the constant \$1 - This value is added to the operand at input B, which is the contents of the PC, and the result is <u>step2</u>: stored in register Z. - The updated value is moved from register & back into the PC Step 3: - The word fetched from the memory is load into the IR. Step 4: - The instruction decoding circuit interprets the contents of the IR. - This enables the control circuitry to activate the control signals - The contents of register R3 are transferred to the MAR . - a memory read operation is initiated. Step 5: Then the contents of RI are transferred to register 4 to prepare for the addition operation. - When the Read operation is completed, the memory Step 6 operand is evailable in register RIDR and the addition operation is performed. - The contents of MOR are gated to the bus and thus also to the B input of the ALU and register Y is selected as the second input to the ALU by choosing Select Y. - The Sum is stored in register Z then transferred to RI. Step 7: \* The End signal causes a new enstruction fetch cycle. to begin by returning to step1.

Branch Instructions: \* A branch instruction replaces the contents of the PC with the branch target address. - The address is obtained by adding an offset X to the updated value of the PC Control sequence for an unconditional Branch instruction. Action Step PCout, MARin, Read, Select 4, Add, Zin 1 Zout, PCin, Yim, WMFC ð MDRout, IRin 3 Offset - field - of - Rout, Add, Zin 4 5 Zout, PCvin, End \* Processing starts with the fetch phase. - The phase ends when the instruction is load in the IR. \* The offset value is extracted from the IR by the instruction decoding circuit step4: Since the value of the updated PC is already in register Y, the offset x is galed onto the bus. and an addition operation is performed. The result, which is the branch target address, is Step 5 loaded into PC. \* The offset × used is the difference between the branch target address and the address immediately following the branch enstruction. The PC is incremented during the fetch phase, before knowing the type & instruction being executed.

Data Path: \* The data path is the path way that the data takes through the cpu \* The datapath consists of functional write that perform addition, subtraction, Logical AND, OR, shifting etc. Datapath Elements: The elements are: i) Instruction Memory ii) Program Counter (PC) iii) Registers IV) ALU V) Adder vi) Data Memory Unit vii) Sign Extension Unit viii) Mux/Multiplexer. Instruction Memory: 1) It is a memory unit to store the instructions X of a program and supply instructions given an address. > Instruction Address Instruction -Instruction Momory -> It is a combination Logic: - The output at any time reflects the contents



\* a register write must be explicitly indicated by asserting the write control signal. iv) ALU : \* It takes two 32-bit inputs \* produces a 32-bit result, as well as 1-bit signal if the result is 0. -> The operation to be performed by the ALU is controlled with the ALU operation signal, which will be 4 bits wide. - The overflow output will not be needed. 4 ALU operation Lero. ALU ALU result V) Adder: \* The adder is an ALU wired to always add its two 32-bit inputs and place the sum on ils output. \* To increment the PC to the address of the next instruction. - Cannot perform the other ALU functions >Add Sum Vi) Data Memory Unit: \* The memory unit is a state element with enputs for the i) address ii) write data - single output for the read result \* The data memory has read and write control signals.

- The merrory unit MemWrite needs a read figm) Read reading the value data Address 2 an invali address can cause problems. Data Memory Write data MemRoad Vii) Sign Extension Unit: \* It has a 16-bit input that a sign-extended into a 32-bit result appearing on the output - To increase the size of a data item 32 Signextend MUX/ Multiplexer : viii) \* It is also called as data selector \* selects from among several inputs based on the setting of its control lines.



DESIGNING A CONTROL UNIT \* To design the main control whit, three instruction classes are used. They are : 1. R-type instructions 2. Branch instructions 3. Load-Store instructions 1. R- type Instruction: funct Rd shart rt. rs 0 Field 5:0 10:6 31:26 25:21 20:16 15:11 Bit Positions \* opcode -> 0 (value - Zevo) \* Three register operands: sources : rs, xt destination : rd -\* ALU operations -> funct Is decoded by the ALU control Add, sub, of etc. \* shifting operation -> shamt (shift amount) 2. Branch instructions: 28 Field rt address 4 Bit positions 31:26 25:21 20:16 15:0 \* opeode >4 \* The registers rs and st are source registers that are compared for equality. \* 16 bit address field is sign extended, shifted and added to the PC to compute the branch target address

3. Load or Store Instruction: address field rt 35 01 43 23 Bit Positions 31:26 25:21 15:0 20:16 \* opcode -> 35ten (load) 43 ten (store). \* The register rs is the base register - It is added to the 16 bit address field to form the memory address register \* At is the destination, for the loaded value \* Load : \* st is the source register whose value \* Store: should be stored into memory. Simple Datapath with the control unit: \* The enput to the control unit is the 6-bit opcode gield from the enstruction. \* The output of the control unit consists of 8 signals. -> Three I bit signals -> used to control oncelliplexons. -> Three signals (Regwrite, MernRead and MernWrite) - controlling reads and writes in the register and data memory. > 1-bit signal (branch) > used to determine whether to possibly branch. > 2-bit control signal (ALUOP) > used for ALU. AND gate: - used to combine the branch control signal and the zero output from the ALU. - output controls the selection of the next PC.

## The simple datapath with the control unit



- The input to the control unit is the 6-bit opcode field from the instruction.
- The outputs of the control unit consist of
  - three 1-bit signals that are used to control multiplexors (RegDst, ALUSrc, and MemtoReg),
  - three signals for controlling reads and writes in the register file and data memory (RegWrite, MemRead, and MemWrite),
  - a 1-bit signal used in determining whether to possibly branch (Branch), and
  - a 2-bit control signal for the ALU (ALUOp).
- An AND gate is used to combine the branch control signal and the Zero output from the ALU; the AND gate output controls the selection of the next PC.

### i) Operation of the datapath for an R-type instruction

Ex:

add \$t1,\$t2,\$t3

Four steps to execute the instruction

- 1. The instruction is fetched, and the PC is incremented.
- 2. Two registers, \$t2 and \$t3, are read from the register file; also, the main control unit computes the setting of the control lines during this step.
- 3. The ALU operates on the data read from the register file, using the function code (bits 5:0, which is the funct fi eld, of the instruction) to generate the ALU function.
- 4. The result from the ALU is written into the register fi le using bits 15:11 of the instruction to select the destination register (\$t1).



The datapath in operation for an R-type instruction

## ii) The execution of a load word

Ex:

## lw \$t1, offset(\$t2)

#### Five steps :

- 1. An instruction is fetched from the instruction memory, and the PC is incremented.
- 2. A register (\$t2) value is read from the register file.
- 3. The ALU computes the sum of the value read from the register file and the sign-extended, lower 16 bits of the instruction (offset).
- 4. The sum from the ALU is used as the address for the data memory.
- 5. The data from the memory unit is written into the register fi le; the register destination is given by bits 20:16 of the instruction (\$t1).



#### The datapath in operation for a load instruction

The datapath in operation for a load instruction

- The control lines, datapath units, and connections that are active are highlighted.
- A store instruction would operate very similarly.
- The main diff erence would be that the memory control would indicate a write rather than a read, the second register value read would be used for the data to store, and the operation of writing the data memory value to the register fi le would not occur.

## iii) Operation of the branch-on-equal instruction

Ex:

beq \$t1, \$t2, offset

Four steps in execution:

- 1. An instruction is fetched from the instruction memory, and the PC is incremented.
- 2. Two registers, \$t1 and \$t2, are read from the register file.
- 3. Th e ALU performs a subtract on the data values read from the register file. The value of PC + 4 is added to the sign-extended, lower 16 bits of the instruction (offset) shift ed left by two; the result is the branch target address.
- 4. The Zero result from the ALU is used to decide which adder result to store into the PC.

The datapath in operation for a branch-on-equal instruction



HARDWIRED CONTROL \* To execute instructions, the processor must have some means of generating the control signals needed in the proper sequence Two Approaches: i) Hardwired Control ii) Microprogrammed Control Control Unit Organization Reset Control Step CLK clock counter step decoder Ti Ta ... External INSI INS2 4 inputs Instruction IR decoder Encoder Condition Codes INSm End Run Control Signals \* Each step in the control sequence for the execution of an instruction is completed in one clock period. Counter : \* Used to keep track of the control steps \* Each state, or count, of this counter corresponds to one control step.

Determination of the required control signals: information: \* Contents of the control step counter \* contents of the enstruction requister \* contents of the condition code flags \* External input signals - MFC - interrupt requests Decoder/encoder: \* It is a combinational circuit that generates the required control inputs, depending on the state of all its inputs. Step Decoder: \* Provides a separate signal line for each step, or time slot, in the control sequence. Instruction decoder: \* The output of the instruction decoder consists of a separate line for each machine instruction \* for any instruction loaded in the IR, one of the output lines INS, through INSm is set to 1, and all other lines are set to 0. Encoder: \* The input signals to the encoder block are compined to generate the individual control signals Yin, PCout, Add, End and so on.

EX: of the Zin Control Signal for the Generation Processor = T, + T6. ADD + T4. BR Zin Add Branch - 16 TA TI Zin \* This signal is assorted during time slot Ti for all instructions, during T6 for an Add instruction, deving Ty for an unconditional branch instruction and so on - The logic function for Zin is derived from the control sequerees. \* A circuit that generates the End Control signal from End : The logic function End =  $T_7 \cdot ADD + T_5 \cdot BR + (T_5 \cdot N + T_4 \cdot \overline{N}) \cdot BRN +$ Ad <u>Greneration of the End Control signal</u>. N Branch <0, Branch - TA T5 T7 -T5 -... . . . End

\* starts a new instruction fetch cycle by resetting End signal: The control step counter to its starting value. RUN ! \* Control signal \* When set to 1, RUN causes the counter to the incremented by one at the end of every clock \* When set to 0, the counter stops counting. cycle. - This is needed whenever the WMFC signal is issued, to cause the processor to wait for the reply from the memory. +It can be viewed as a state machine that changes control hardware: from one state to another in every clock cycle, depending on the contents of the instruction register, the condition eader, and the external inputs \* The outputs of the state machine are the control signals. -> The sequence of operations carried out by the machine is determined by the wiring of the logic elements, hence the name . " hardwied ". \* A controller that uses this approach can operate at high speed. \* It has little flessebility \* The complexity of the instruction set it can implement is limited.

MICROPROGRAMMED CONTROL \* control signals are generated by a program similar to machine language programs. Terms: - A word whose individual bits represent the Control word: various control signals. \* Each of the control steps in the control sequence of an instruction défines à unique combination of 1s and os in the CW. Micro-routine: \* A sequence of control words corresponding to the control sequence of a machine instruction constitutes the micro-routine for that instruction + The individual control words in the micro-routine Micro instructions A The microsoutines for all instructions in the control Store: instruction set of a computer are stored in a special memory called the control store.

organization of the control Unit External inputs Starting and Condition branch IR address codes generator MPC Clock CW Control \* The control whit can generate the control signals for any instruction by sequentially reading the CWs of the corresponding micro routine from the control store. Micro Program counter (MPC): \* To read the control words sequentially from the control store. \* Every time a new instruction is loaded into the IR : IR, the output of the block labeled "stanting address generator" is loaded ento the pepe. \* The peper is then automatically incremented by dock ; the clock, causing successive ménoinstructions to be read from the control store. -> Hence, the control signals are delivered to various parts of the processor in the correct sequence.

\* Microinstructions: (i) Straightonward way: - To assign one bit position to each control signal. Drawback: - long microcristructions \* 42 control signals are needed. + 42 bits would be needed in each micro instruction. i) Binary coding Scheme. \* signals can be grouped so that all mutually exclusive signals are placed in the same geoup. - At most one microoperation per group is specified in any mêas instruction. \* Represent the signals within a group. Partial format for field-encoded microinstructions EX: FI F2 F3 F4 F5 F3(3bils) F5 (2 bits) FI(4bils) F2(8 bils) FA(Houls) oo No achon 000 : No transfer 600 : No transfer 0000 : No bransfer 0000 Add 0 1 ! Read OOL: MARIS 0001: Sub 0001; PCout 001: PCin 10: Write 0010 : MDRout 010 MDRin 010: 1Rin OIL TEMPIN 0011: Zout oll; Zin 100: Yin 100 : Roin 0100 ; Roout 1111: XOR 101: Rin 0101 : Rout 110. Rein 0110 : Roout 16 ALU 111 : R3in 0111 : R3out Junctions 1010 : TEMPout 1011: offsetout F6 F7 F8 F6 (1 bit) FT(1bit) F8 (1 bit) 0: continue 0: No oction O. Select Y 1:End 1: WMFC 1: Select 4

\* 20 bits are needed to store the patterns for the 42 signals

Encoded Scheme Minimally encoded Highly encoded Scheme Scheme Horizontal organization -> Many resources can be Vertical organization controlled with a single → Use compact code to micro instruction specify only a small number of control functions -> higher operating speed in each microinstruction -) allows parallel use of ->slower operating speed -> feurer bits are required resources - less hardware is needed for execution . Microprogram sequencing: \* The simple microprogram requires only straight forward sequential execution of microinstructions, except for the branch at the end of the fetch phase. - upe governs the sequencing \* Some branching capability within the microprogram can be introduced through special branch microinstruction - Standard <u>software techniques</u> can be used for writing micro programs. Dijadvantages i) Large total number of micro instructions and a large control store Bypass the Branching: Execution time is longer i) \* Branch Address Modification Using Bit-ORing -Use an OR gate to change ISB of address to 1 ii) & use 2 conditional branch microinstructions iii) \* To include '2 next address fields within a branch microuristy

Wide - Branch Addressing:

\* The enstruction decoder generates the starting address of the onicrosoutine that implements the instruction that has just been loaded into the IR.

IR contains the instruction, for which the instruction decoder generates the microinstruction.
The address cannot be loaded as is into the microprogram counter.
The Source operand of the instruction can be source operand of the instruction modes.

Specified is any of several addressing modes. - The bit-ORing technique can be used to modify the starting address generaled by the instruction decoder to reach the appropriate path

Use of WMFC: \* The WMFC signal means that the onicocinstruction may take several clock cycles to complete. - If the branch is allowed to happen in the first clock cycle, the microinstruction would be fetched and executed prematurely. > To avoid this problem, the branch must not take place until the memory transfer in progress is completed, that is, the WHFC signal must withibit any change in the contents of the microprogram counter during the waiting period.

Microinstructions with Next-Address Field. Branch microinstructions: - perform no useful operation in the datapath - they are needed only to determine the address of the next microvistruction. \* To include an address field as a part of every microinstruction to indicate the location of the next micro instruction to be fetched. - Address field of 12 bits is required. 1/6 of the control store capacity \* Each instruction contains the address of the next instruction - no need for a counter to keep track of sequential ad dress es. A the peper is replaced with a microinstruction address MAR: register (MAB), which is loaded from the next-address field in each mias instruction. Nicroinstruction IR condition. External Decoding circuits HAR 4 Control Store L MIR Next address Microinstruction decoder. Control Signals

The next address bits are fied through the OR gate: OR gates to the MAR, so that the address can be modified on the basis of the data in the IR, external inputs and condition codes. Decoding circuits: \* generate the starting address of a given microsoulisie on the basis of the opcode in the IR. Fig : a) Format for miceo instructions b) Ex: Implementation of the micronoutine using a next - microcinstruction address field. Prefetching Microcinstructions: \* Faster operation is achieved if the next microvistauctions & is prefetched while the current one is being executed. - the execution time can be overlapped with the fetch time. Difficulty: fetch must be repealed with correct address, which requires more complex hardware. Main function of Microprogrammed Control Emulation: - simple - Herible - relatively inexpensive execution of machine instruction. \* Programs written in the machine language of M2 can ner Computer MI. Mi emulates M2. \* Emulation allows to replace obsolete equipment with more up-to-date machines. X

2.5

# PIPELINING

\* Pipelining is an implementation technique Helism >expols por in which multiple instructions are overlapped the instruction in a requestion in fuction man. in execution. among - The pipelined approach takes much less time - The stages is pipelining are operating concurrently. inveganes Pipeline instruction execution: N. D. Brrutes MIPS instructions - Five steps: 1. Fetch instruction from mornory overstig 2. Read registers while decoding the instruction. Reading and decoding occur simultaneously. Execute the operation or calculate an address. Access an operand in data memory. 3. Write the result into a register. 4. 5. Total Time for each instruction calculation EX: Register Total Data Register Read ALU Instruction Instruction operation write Time Access Fetch class 10008 20005 800ps Load word (lw) 200 28 10008 200 ps 200 ps 200 ps 10008 Foops store word (sw) 20008 10008 600 08 200 08 e daei R-format(add, 10008 20008 sub, AND, OR, SIL 200 ps Boops 10.13 10008 20005 Branch(beg) on the pipelined processor: enstructions Time between Time between instructions nonpipeli enstructions pipelined Time between Number of pipe stages Jalency: / Pipeline The number of stages in a pipeline numbre of stages blu 2 instrus during execution.

Nonpipelined Execution

MIPS Instruction

Los Strottester

411. ethet (312

5 - ----

Program execution 1200 1400 1000 1800 Tune 1600 400 600 800 200 order (in instructions) lw \$1, 100(\$0) Instr Reg ALU Data Reg Fetch Data Reg Instr ALU ACCOM Reg 800ps Fetch Lw \$2,200(\$0) Instruction Fetch 200 ps Lw \$3,300(\$0) 800pB Pipelined Execution program Time exocution 400 600 200 ROD 1200 1000 1400 order (in instruction) Instruction Data Reg Reg Lw \$1,100(\$0) AU Acces Fetch lw \$2,200(\$0) Instr Data Reg 1 Pag ALU Acces Fetch lw \$3,300(\$0) Data Inst Reg ALU Access Feth \* Pipelining improves performance by increasing instruction throughput, as opposed to decreasing the execution time of an individual instruction. Designing Instruction sets for Pipelining 1. All MIPS instructions are the same length 2. MIPS has only a few instruction formats. 3. Memory operands only appear in loads or stores in MIPS 4. Operands must be aligned in memory.



FIGURE 4.44 Traditional multiple-clock-cycle pipeline diagram of five instructions in Figure 4.43.



FIGURE 4.45 The single-clock-cycle diagram corresponding to clock cycle 5 of the pipeline in Figures 4.43 and 4.44. As you can see, a single-clock-cycle figure is a vertical slice through a multiple-clock-cycle diagram.

\* Pipeline Hazards: \* Situations in pipelining when the next instruction cannot execute in the following clock cycle. - These events are called hazards. Three types : 1. Structural Hazards 2. Data Hazards 3. Control Hazards \* When a planned instruction cannot execute in the proper Ostructural Hazards: clock cycle because the hardware does not support the combination of instructions that are set to execute. \* Instruction is accessing data from memory. \* At the same time, another instruction is fetching from the same memory - structural hazard occur. -> Need for two memories. 2 Data Hazards: -X When a planned instruction cannot execute in the proper clock cycle because data that is nooded to execute the instruction is not yet available occur when the pipeline must be stalled because one step must wait for another to complete. \* In a computer pipeline, data haraveds arise from the dependence of one instruction on an earlier one that is still in the pipeline. <u>+x</u>: add \$ \$0, \$to, \$t1 sub \$ t2, \$ so, \$t3

Solution: D Forwarding / by passing: \* Adding extra parduare to retrieve the missing item early from the internal resources. -> A method of resolving a data hazard by retrieving The missing data element from internal batters rather than waiting for it to arrive from programmer visible megisters on memory. Graphical Representation of the instruction pipeline 1000 800 600 400 200 Time MEM EX add \$50, \$10, \$11 - Instruction Fetch - box representing mistruction manny Five stages: IF - Instruction Decode/ \_ register file being read. 12 Register File Read EX - Execution stage - representing ALU MEM - Memory Access stage - box representing data memory WB - Write back stage - register file being written \* shading indicates the element is used by the instruction MEM - white background Shading on the right half - the element is read. left half - written in That stage Ciraphical representation of forwarding 1000 800 600 Program Time MEM WB × F DE (in instr.) and \$ 80,\$to,\$t1 [] WB; MEMI TD IF sub \$t2,\$\$0,\$t3

\* Connection to forward the value in \$100 abter the execution stage of the add eristruction as input to the execution stage the sub instruction. -> Forwarding works very well. - It cannot prevent all pipeline stalls. Load-use data hazard: \* Specific form of data hazard - data being loaded by a load instruction has not yet become available when it is needed by another instruction. load of \$ 80 instead of add > the desired data would be available only after the fourth stage of the first instruction - too late for the input of the 3rd stage of sub. Pipeline Stall: / Bubble: . \* Needs when an R-format instruction following 2 a load tries to use the data. 1000 1200 1400 800 600 400 200 Program aution Time order WB MEM lu \$80, 20(\$H) bubbles (buble) (bubble ; [ bubble } bubble } MEN D Sub \$t2, \$80, \$t3 \* A stall initiated in order to resolve a hazard. -> Reorders code to try to avoid load-use pipeline stalls Disadvantages of Forwarding \* Forwarding is harder if there are multiple results to forward per enstruction or they need to write a result (3) Reordering the code

Control Hazards: /branch hazard \* Arising from the need to make a decision based on the results 2 one instruction while others are executing. > When the proper instruction cannot execute in the proper Pipeline clock cycle because the instruction that was Betched is not the one that is needed; - The flow & instruction address is not what the pipeline espected. Decision task - branch instruction: \* Fetching the instruction following the branch. Soln: - To stall immediately after fetching a Solution: brarch, waiting until the pipeline determines The outcome & the branch and knows what instruction address to fetch from. conditional branch Stalling every on 200 1000 600 800 400 200 hograp Time Execution order (m instris) Data Reg Instruction Reg ALU Acces add \$4,\$5,\$b Fetch Data Costructor Reg beg \$1,\$2,40 Reg ALU Access Fetch (bubble Emples Emples (bubble) (bubble) month Access Rog ALU Reg Fetch or \$7,\$8,\$9 > One-stage pipeline stall, or bubble, after the branch. Prediction \* use prediction to handle branches. - to predict always that branches will be antaken \* only when branches are taken does the pipeline stall.

Solution 2: Branch prediction: \* A method of resolving a branch hazard that assumes a given outcome for the branch and proceeds from that assumption rather than waiting to ascertain the actual outcome. The branch is not taken EX: 1400 1200 1000 600 800 program 400 200 excusion Time order Data Instruction Rog (in instr) Ray ALU access add \$4,\$5,\$6 Fetch Data Instruction Rag Rog ALU Access Fetch beq, \$1, \$2, 40 0 Instuction Pag ALU Rog ALU Fetch lw \$3,300 (\$0) The branch is taken 400 1200 400 600 800 1000 program 200 oxpution Time order Dala (in inst) Insprection Reg ALU Reg Access Fetch add \$4,\$5,\$6 Data Reg Instruction ALU Reg access beg, \$1,\$2,40 Fetch bubb (bubble) (bubble) bubble. bubble) Data ALU Instruction Reg Pg acces > 65 \$7,\$8,\$9 Fetch Solution 3: - Delayed decision / delayed branch. - used by MIBS architecture. \* The delayed branch always executes the rest sequential instruction, with the branch taking place after that one instruction delay. - used branches are short tong i this based branch prediction is used

DATA HAZARDS: Forwarding Versus Stalling Data Hazarda: Also called a pipeline data hazard. When a planned instruction cannot execute in the proper clock cycle Decause data that is reeded to execute the instructions is not yet available. Data Hazards are obstacles to pipelined execution # Register \$2 written by sub Sub \$2, \$1, \$3 # 1st operand (\$2) depends on sub -aud \$12, \$2, \$5 # and operand (\$2) depends on Sub ол \$13, \$6, \$2 #1st (\$2) + 2nd (\$2) depend on sub add \$ 14,\$2,\$2 # Base (\$2) depends on sub SW \$15, 100 (\$2) The last four instructions are all dependent on the result in register \$2 of the first instruction. If register \$2 had the value to Defore the pubtract instruction and -20 afterwards, the programmer intends that - 20 will be used in the following negister that refer to register \$2.



**FIGURE 4.52 Pipelined dependences in a five-instruction sequence using simplified datapaths to show the dependences.** All the dependent actions are shown in color, and "CC 1" at the top of the figure means clock cycle 1. The first instruction writes into \$2, and all the following instructions read \$2. This register is written in clock cycle 5, so the proper value is unavailable before clock cycle 5. (A read of a register during a clock cycle returns the value written at the end of the first half of the cycle, when such a write occurs.) The colored lines from the top datapath to the lower ones show the dependences. Those that must go backward in time are *pipeline data hazards*.

pelined dependencies instruction sequence datapaths SIMP dependencies result of the sub instruction unless the or later occurred during clock cycle 5 read \* The lines from the top datapath to the lower ones show the dependencies

\* Fig 2 shows the dependencies Detween the pipeline registers and the inputs to the ALU for the same code sequence as in Fig1. \* The change is that the dependencies Degin from a pipeline register, rather than waiting for the WB stage to write the register file. \* Thus, the required data exists in time for later instructions, with the zigeline registers holding the data to be forwarded. The conditions for detecting hazards and the control alignals to resolve them. 1 EX hazard:if LEX/MEN. RegNoite and CEX/MEM. Register Rd =0) and (EX/MEM. Register Rd = ID/EX. Register Rs)) Forward A=10 if (EX/MEM. Regwrite and (EX/MEM. RegisterRd =0) and (EX/MEH. RegisterRd = ID/EX. Register Rt)) Forward B=10



FIGURE 4.53 The dependences between the pipeline registers move forward in time, so it is possible to supply the inputs to the ALU needed by the AND instruction and OR instruction by forwarding the results found in the pipeline registers. The values in the pipeline registers show that the desired value is available before it is written into the register file. We assume that the register file forwards values that are read and written during the same clock cycle, so the add does not stall, but the values come from the register file instead of a pipeline register. Register file "forwarding"—that is, the read gets the value of the write in that clock cycle—is why clock cycle 5 shows register \$2 having the value 10 at the beginning and -20 at the end of the clock cycle. As in the rest of this section, we handle all forwarding except for the value to be stored by a store instruction.

MEM hazard if (NEM/WB. Rog Noite and (MEM/WB - Register Rd = 0) and (MEM/WB. Register Rd=ID/Ex. Register Rs)) For WardA=01 if (MEM/WB. Reg Write and (MEM/WB Register Rd 70) and (MEM/WB. Register Rd = ID/FX. Register R+)) Forward B = 01

Hazards and Stalls: ata ( JD/Ex. Hem Read and 1 ((ID/EX. Register Rt = IF/ID. Register R) or (ID/EX. Register Rt = IF/ID. Register Rt)) -1 Stall the pipeline. if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))) 'stall the pipeline Time (in clock cycles) CC 1 CC 2 CC 5 CC 6 CC 7 CC 3 CC 4 CC 8 CC 9 Program execution order (in instructions) lw \$2, 20(\$1) and \$4, \$2, \$5 IM DM or \$8, \$2, \$6 IM DM Reg add \$9, \$4, \$2 DM Reg slt \$1, \$6, \$7 IM Reg

**FIGURE 4.58** A pipelined sequence of instructions. Since the dependence between the load and the following instruction (and) goes backward in time, this hazard cannot be solved by forwarding. Hence, this combination must result in a stall by the hazard detection unit.

6 Hazards and stalls: Data when an instruction tries to read a register following a load instruction that writes the same register. Not An instruction that does operation no change state. to - 914 M Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 Program execution order (in instructions) lw \$2, 20(\$1) bubble Regi and becomes nop and \$4, \$2, \$5 DM legi

**FIGURE 4.59** The way stalls are really inserted into the pipeline. A bubble is inserted beginning in clock cycle 4, by changing the and instruction to a nop. Note that the and instruction is really fetched and decoded in clock cycles 2 and 3, but its EX stage is delayed until clock cycle 5 (versus the unstalled position in clock cycle 4). Likewise the OR instruction is fetched in clock cycle 3, but its ID stage is delayed until clock cycle 5 (versus the unstalled clock cycle 4 position). After insertion of the bubble, all the dependences go forward in time and no further hazards occur.

or \$8, \$2, \$6

add \$9, \$4, \$2





**FIGURE 4.60** Pipelined control overview, showing the two multiplexors for forwarding, the hazard detection unit, and the forwarding unit. Although the ID and EX stages have been simplified—the sign-extended immediate and branch logic are missing—this drawing gives the essence of the forwarding hardware requirements.

Hazards: -ontrol

Inanch

\* Control hazards voccurs when we execute the in pipeline process. instruction Iranch

\* control thozards also called ilranch thazards instruction cannot execute in the moner the when cycle **clock** maper

\* An instruction must be fetched at every to sustain the pipeline. cycle clock

\* The delay in determining the proper yetch is called control chazard or at instruction hazard.



FIGURE 4.61 The impact of the pipeline on the branch instruction. The numbers to the left of the instruction (40, 44, ...) are the addresses of the instructions. Since the branch instruction decides whether to branch in the MEM stage-clock cycle 4 for the beq instruction above-the three sequential instructions that follow the branch will be fetched and begin execution. Without intervention, those three following instructions will begin execution before beq branches to 1 w at location 72. (Figure 4.31 assumed extra hardware to reduce the control hazard to one clock cycle; this figure uses the nonoptimized datapath.)

to revolue control hazards. Methods The methods to resolue control chazards are () \* Branch Not Jaken (2) \* Reducing the delay of branches (3) \* Dynamic Branch Prediction. () Branch Not Jaken: \* If the Iwarch is taken then the instructions that are being fetched and decoded must the direct \* After discarding the execution continues at the Ivranch target. \* Discarding instructions means we must be able to yearsh instructions in the IF, 1D and EX stages of the pipeline. \* Flush is a method used to discard instructions in a pipeline usually due to an inexpected

www.

\* ty wanch not taken means no need to discard any instructions and pipelining will resecute the instructions continuously. \* Branches are itaken Ethen only pipeline thas istalled so by reducing the idelay of branches we can improve the performance of pipelining process in this condition.

Reducing the delay of Branches:.

\* One way to improve branch performance is to reduce the cost of the taken branch.

\* The next PC you a branch is relected in the MEM stage only but if we nove the branch execution cearlier in the pipeline then yeurer instructions need to be glurhed.

\* Moving Irranch execution earlier in the pipeline will increase the speed of performance.

\* When a more complex branch decision is required, a reparate instruction that uses an ALV to perform a comparison is required - a vituation that is isimilar to the use of condition codes for

wanches.

\* Moving the Irranch decision up requires two. vactions to occur earlier. They are > computing the branch target address > Evaluating the branch decision. \* The easy part of the change is to move up the branch address calculation. \* Now, just move the branch adder from the EX stage to ID stage. \* The harder part is the branch decision itself. For branch equal, we would compare two registers read during 10 stage to see if they are equal. \* Equality is tested by using exclusive bRing. \* For example, to implement Irranch on equal, we will need to forward vienelts to the requality test elogic that operates during ID. There are two complicating factors. During 1D, we must decode the instruction, ideaide whether supports to the equality unit is needed.

Forwarding the operands tog branches are handled My ALV forwoording logic unit.

Note that supposed vource operands of a branch can come from either the ALV [MEM Or MEM]WB pipeline latches.

2) The values in a branch comparison are needed adwing 1D stage but may the produced later in time, it is possible to occur data chazard and stall will be needed.

To overcome these difficulties we can move the Ivranch EX to the 1D stage, because it reduces the penalty of a Ivranch to only one instruction if the branch is taken.

Dynamic Branch Prediction.

(3)

\* Dynamic Ivranch prediction is the prediction of Ivranches at vurtime using vurtime information \* One approach is to look up the address of the instruction to see if a branch was taken the last time this instruction was resecuted, rand, if so, to begin yetching new instructions from the vame place as the last time. This lechnique called dynamic branch predition. ùs

\* One implementation of that approach is a elugger or ibranch chistory table Ivranch yrrediction

\* A branch prediction luffer is a ismall that is indexed by the lower portion memory address of the branch instruction and that of the whether the one or more bits indicating contains branch

was viecently taken or not



FIGURE 4.63 The states in a 2-bit prediction scheme. By using 2 bits rather than 1, a branch that strongly favors taken or not taken-as many branches do-will be mispredicted only once. The 2 bits are used to encode the four states in the system. The 2-bit scheme is a general instance of a counter-based predictor, which is incremented when the prediction is accurate and decremented otherwise, and uses the mid-point of its range as the division between taken and not taken.

## The final datapath and control

